as a bidirected graph
as a bidirected graph with haplotype paths
A snarl is a subgraph bounded by two node sides that are:
A snarl is a subgraph bounded by two node sides that are:
A snarl is a subgraph bounded by two node sides that are:
A snarl is a subgraph bounded by two node sides that are:
A run of consecutive snarls and nodes is called a chain.
Snarls and chains can be nested inside of each other.
The nested relationship of snarls and chains is described by the snarl tree.
Netgraphs are a representation of snarls with their child chains collapsed into a single node
Enumerate alleles for each snarl on a reference path. Before: all paths.
Enumerate alleles for each snarl on a reference path. Now: only haplotypes.
Enumerate alleles for each snarl on a reference path inc. nested snarls.
vg deconstruct)##INFO=<ID=LV,Number=1,Type=Integer,Description="Level in the snarl tree (0=top level)">
##INFO=<ID=AT,Number=R,Type=String,Description="Allele Traversal as path in graph">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1
ref 6 >1>4 GGCAC CTTAG 60 . AT=>1>2>4,>1>3>4;LV=0 GT 0|1
ref 15 >4>9 CCCAGG CCGGTAACTACCGTCACCAGG,CCGGTACGTCA 60 . AT=>4>8>9,>4>5>6>7>8>9,>4>5>7>9;LV=0 GT 1|2
vg deconstruct)##INFO=<ID=LV,Number=1,Type=Integer,Description="Level in the snarl tree (0=top level)">
##INFO=<ID=AT,Number=R,Type=String,Description="Allele Traversal as path in graph">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1
ref 10 >1>7 AAA AAAAAA,AAAA 60 . AT=>1>5>6>7,>1>2>3>4>5>6>7,>1>4>5>6>7;LV=0 GT 1|2
vg deconstruct)##INFO=<ID=LV,Number=1,Type=Integer,Description="Level in the snarl tree (0=top level)">
##INFO=<ID=AT,Number=R,Type=String,Description="Allele Traversal as path in graph">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1 sample2
ref 11 >2>5 CTTAG AAGTC 60 . AT=>2>3>5,>2>4>5;LV=0 GT 0|0 1|.
vg deconstruct -a)##INFO=<ID=LV,Number=1,Type=Integer,Description="Level in the snarl tree (0=top level)">
##INFO=<ID=PS,Number=1,Type=String,Description="ID of variant corresponding to parent snarl">
##INFO=<ID=AT,Number=R,Type=String,Description="Allele Traversal as path in graph">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1 sample2
ref 13 >1>6 AGGCACCTTAGCGGTAGCTTAGCATCAG AGGCACAAGTCCGGTAGCTTAGCATCAG,A 60 . AT=>1>2>3>5>6,>1>2>4>5>6,>1>6;LV=0 GT 0|0 1|2
ref 19 >2>5 CTTAG AAGTC 60 . AT=>2>3>5,>2>4>5;LV=1;PS=>1>6 GT 0|0 1|.
Trick for getting this snarl decomposition to look better (currently only for the distance index):
vg index -j [graph.dist] -w 6
Snarl 1-5 Snarl 5-9
REF >1>3>4>5 >5>6>7>8>9
ALT1 >1>2>4>5 >5>8>9
ALT2 >1>5 >5>6>9
Solutions?
vcfbub keeps top-level snarls, then
vcfwave aligns REF vs ALT(s) sequence to enumerate
variants. Warning: inconsistent with the original pangenome.vg giraffeShort reads
Long reads
Input: Sequencing read and reference sequence
Input: Sequencing read and reference sequence
Output: Placement of the read on the reference and, usually, the edits between the sequences
Complex graphs can be slow to map to
Simplify graphs by sampling haplotypes similar to reads
On the HPRC v2 Minigraph-cactus graph containing 462 haplotypes + 2 reference genomes
vg graph formats and indexesIndexes
.gbwt (Graph Burrows Wheeler
Transform): haplotype paths.gg (GBWT Graph): node sequences for a
GBWT.dist (Distance Index): snarl
decomposition plus minimum distances.zipcodes: per-node distance
information used by vg giraffe.min (Minimizer Index): minimizers
used by vg giraffe.gcsa (Generalized Compressed Suffix
Array): substring index used by vg map and
vg mpmapGraphs
.gbz (GBWT + GG): the graph induced by
the GBWT.hg (/.vg) (HashGraph):
graph format optimized for speed.pg (/.vg) (PackedGraph):
graph format optimized for space efficiency.xg: older graph format.vg: protobuf-based graph formatThese slides: https://github.com/jmonlong/getapan2025
vg wiki
vg manpage: https://github.com/vgteam/vg/wiki/vg-manpage
Snarls paper doi:10.1089/cmb.2017.0251
Short read giraffe paper doi:10.1126/science.abg8871
Long read giraffe paper doi:10.1101/2025.09.29.678807
Vague understanding from Paten et al Journal of Computational Biology 2018.